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ABSTRACT 


Academic achievement of a student in college always has a 
far-reaching impact on his further development. With the 
rise of the ubiquitous sensing technology, students’ digital 
footprints in campus can be collected to gain insights into 
their daily behaviours and predict their academic achieve- 
ments. In this paper, we propose a framework named AAP- 
EDM (Academic Achievement Prediction via Educational 
Data Mining) to predict students’ academic achievements 
based on the influencing factors we have discovered. Multi- 
source heterogeneous data including Wi-Fi detection records, 
usage of smartcards, usage of campus network, is aggregated 
firstly. Then, instead of the self-reported features or tradi- 
tional academic assessments like test scores, we extract fea- 
tures reflecting students’ behavioural patterns. Specially, we 
define DOH (Degree of Hardworking) to improve the per- 
formance of the classifier. Finally, we analyze the features 
extracted and apply supervised learning methods to predict 
their academic achievements. Experiments are conducted 
on real-world data from 528 college students in one faculty, 
and the classification accuracy can be up to 88%. 
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1. INTRODUCTION 


Predicting students’ academic achievements is one of the 
most popular applications in Educational Data Mining. One 
research predicted students’ academic achievements by an- 
alyzing students’ static information such as gender, charac- 
ter, eating habits and place of residence.[2]._ Authors used 
predictive modeling methods to identify at-risk students in 
a course using standards-based grading.[5]. Authors found 
that students’ achievements were best inferred from their 
social ties through modified smartphones.[4]. Researchers 
demonstrated the impact of students’ psychology in predict- 
ing their academic achievements using examination scores, 
information processing abilities as features [3]. Under the 
circumstance of online learning, researchers predicted 145 
students’ academic achievements utilizing their online learn- 
ing activities and online discussion forums [7, 8]. There are 
also authors who used passive sensing data and self-reports 
from students’ smartphones and proposed a model based on 
linear regression with lasso regularization to predict GPA 
[9]. 


Our study is conducted to make up for the two shortcom- 
ings in the previous studies. On the one hand, compared 
with standard academic assessments or personal static infor- 
mation, students’ daily behaviours which can be monitored 
anytime can reflect their states of living and learning more 
sensitively and timely. Past research has shown that stu- 
dents’ academic achievements have relationships with their 
daily behaviours [9]. We inspect students’ behaviours by 
analyzing their trajectories, class schedule, campus network 
usage and smartcard usage. On the other hand, our study 
is conducted based on a complete passive detection system 
with no active participation of students which facilitates 
continual studies of a larger scale [6, 10]. It is important 
to mention that we care about the privacy protection very 
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much and all of students’ information involved in the study 
is anonymous. 


In this paper, we propose a framework named AAP-EDM 
(Academic achievement prediction via educational data min- 
ing) to analyze data generated from digital campus in order 
to predict students’ academic achievements. The framework 
contains mainly three main modules. Multi-source hetero- 
geneous data merging is the first. After that, we extract 
features such as wake-up time, duration of stay in the dormi- 
tory, and class attendance. We discovered the potential in- 
fluencing factors of academic achievements through ANOVA 
F-test and correlation coefficients analysis. Furthermore, 
we defined the feature DOH (Degree of Hardworking) to 
consider the features we have extracted comprehensively. 
Then, we formalized the prediction as a binary classification 
problem to identify students at risk and choose the best so- 
lution from multiple classification algorithms consisting of 
SVM, Logistic Regression,Naive Bayes and Decision Tree. 
Finally, we evaluated the proposed framework over a real- 
world dataset involving 528 undergraduates, and found that 
the classification accuracy can be up to 88%. 


Our main contributions in this paper are listed below: 


(1) We predicted students’ academic achievements utilizing 
students’ daily life behaviour data rather than using aca- 
demic assessments such as test scores. The high accuracy 
rate indicates that students’ academic achievements have 
strong relationships with their daily behaviours. 


(2) We extracted abundant features which can describe stu- 
dents daily life in detail and also define the DOH which 
improves the performance of classifiers. 


(3) In order to explore students’ behaviour patterns exten- 
sively, we came up with methods to fuse the multi-source 
heterogeneous data of college students. Our research can be 
easily expanded to much larger scale. 


2. PROBLEM FORMULATION 


Our raw data consists of four components. First, students’ 
usage of campus network is monitored in real time. Then 
when students use their smartcards on campus such as when 
having breakfast and going shopping, their behaviours will 
also be captured. Moreover, through the Wi-Fi monitors 
we deployed in the entrance of particular places in the cam- 
pus, Wi-Fi packets from students’ smartphones with Wi-Fi 
enabled can be captured when they pass by the monitors 
without connecting to the network. Besides the three parts 
above, we have static data including students’ class sched- 
ules and academic achievements. We will introduce the data 
set in detail in the next section. Based on the data, our 
target is to extract features of students and train models 
utilizing supervised learning algorithms to predict academic 
achievements. 

Formally, given the input matrix X € R“*% where M rep- 
resents the total number of students and JN is the number 
of features which will be introduced later and the academic 
achievements labels matrix Y € R™™', our target is to learn 
the function which satisfies Y = f (X). Note that the la- 
bels in our study are either 0 or 1 where 0 represents good 
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Figure 1: Overview of the framework 


Table 1: Data format of Wi-Fi detection records. 


MAC Address Time RSSI Location 
20160301 
OOK 
38BC 91 12:20:23 -70 Canteen #1 


performers and 1 represents students at risk. 


3. METHODS 


In this section, we will introduce our framework AAP-EDM 
in detail. The framework mainly contains multi-source het- 
erogeneous data merging, feature extraction and academic 
achievement prediction which is illustrated in figure 1. 


3.1 Multi-source Heterogeneous Data Merg- 
ing 

3.1.1 Raw Data Set 

The raw data set contains Wi-Fi detection records, usage of 


campus network, usage of smartcards, class schedules and 
also students’ academic achievements. 


Through deploying Wi-Fi monitors at entrances of locations 
such as dormitories, canteens and teaching buildings, it is 
possible to detect smartphones’ MAC addresses, providing 
a coarse-grained location trace for students who enter the 
coverage area of Wi-Fi monitors which is shown in Table 1. 


Students’ information of using campus network is shown in 
Table 2. Specifically, the locations where students access 
the network (building-level) can be inferred from the "IP 
Address”, and the ”Network Traffic” describes the traffic be- 
tween login time and logout time in MBs, which includes 
uplink traffic and downlink traffic. 


The information of students’ devices while connected to the 
campus network is shown in Table 3. In the table, the ”De- 
vice Type” can help us distinguish mobile devices from PC 
and the ”Time” is recorded in days but not seconds compared 
with Table 1. 


Table 4 demonstrates the usage of smartcards. The ”Con- 
sumption Type” includes ”Repast”, “Shopping”, ”Bathing”, 
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Table 2: Data format of usage of campus network 
Login/ 
logout 
Time 


20160301 

08:00:00/ 
20160301 
09:00:00 


IP Address 
(Location) 


Network 


Anonymous ID Traffic 


E416**B2ED — 10.210.**.*** 200 


Table 3: Data format of device information 
IP Address Device 
(Location) Type 


10.210.**.** Mobile 20160301 


MAC Address Time 


”Network cost” and so on. Note that the consumption type 
will reveal the location where students consume with their 
smartcards. 


Other than the data mentioned above, in this paper we also 
utilize students’ class schedules to analyze students’ class 
attendance and utilize students’ academic achievements to 
train the classification model. 


3.1.2 Trajectory Generation 

We arranged the usage of campus network, the usage of 
smartcards and Wi-Fi detection records in chronological or- 
der to form students’ semantic trajectories. In particular, 
we consider students to stay in the specific location during 
the periods between the login time and logout time accord- 
ing to campus network records, until records are captured 
in other locations. The semantic trajectories are shown in 
Table 5. 


3.2 Feature Extraction 
3.2.1 Trajectory Features 


Daily wake-up time: Wake-up time can reflect the degree 
of diligence to a certain extent which is calculated as the 
first time in a day when a student logs in to the network in 
his dormitory. 


Daily time of return to dormitory: Returning to dor- 
mitories at a later time in the evening usually means longer 
periods students spend in the classrooms or the library. We 
regard the last time in a day when a student logs in to the 
network in his dormitory as the time of return to dormitory. 


Daily duration spent in the dormitory: Dormitories 
are usually not appropriate places for studying. We can 
estimate the duration of time spent in dormitory according 
to the time that students enter and leave the dormitory. 
Specially, only the time between 06:30 and 23:30 is under 
consideration. 


Table 4: Data format of usage of smartcard 


Consumption Type 


Anonymous ID Time Cost (Location) 
20160301 
wk 
E416**B2ED 08:00:00 5.0 Repast 


Table 5: Example of a semantic trajec- 
tory in one day 


Id Time Location 

1 07:30:00 Dormitory #13 

2 07:33:14 Canteen #1 

3 08:21:52 Teaching Building #3 
4 11:49:39 Canteen #2 

5 12:50:58 Dormitory #13 

6 = 18:03:58 Canteen #2 

7 18:35:34 Dormitory #13 

8 20:39:16 Teaching Building #2 
9 22:08:56 Super Market 

10 9 22:15:15 Dormitory #13 


Class Attendance: Given the daily trajectory {po > pi > 
... — Pn} where py = (loc, time), the start time ts and the 
end time t. of the course according to class schedules, we 
will judge whether a student attends the class. Considering 
that students must appear in the classrooms and shouldn’t 
have any records in other irrelevant places during the class, 
we propose the method according to two conditions. Eq.1 
ensures that students have no records except in classrooms 
during the class periods. Eq.2 ensures that students are 
indeed in the classrooms. 


{plts + At < time < te — At, loc £ classroom} =@ (1) 
{p|ts — At < time < te + At, loc = classroom} #9 (2) 


Days outside of campus: Students who have no digital 
footprints in one day will be considered as not on campus. 
Students’ academic achievements are supposed to be affected 
if they are often not on campus. 


3.2.2 Network Features 

Daily Network Traffic in Dormitory: We sum up the 
network traffic that students upload and download in their 
dormitories. Compared with dormitories, the network traffic 
in teaching buildings is less, so we don’t take this part into 
consideration. 


Network Cost: Students don’t need to pay for the cam- 
pus network until their used traffic exceeds the upper limit 
of every month. The upper limit of network traffic is al- 
most enough for normal usage, so students who exceed the 
limit may spend too much time on the internet accessing on- 
line videos or online games. We calculate the total network 
charges of each student. 


Network top up Frequency: When the balance of stu- 
dents’ network accounts is zero, students should recharge for 
continual usage. 


Daily Network Traffic Peak: Daily network traffic peak 
is demonstrated as L = {lo,h,...,l23} where In represents 
an hour in a day and takes value of 0 or 1 shown in Eq.3 
where traf fic, is the traffic during the nin hour and the 
average is the average traffic per hour in one day. 


iL. = 1, traf ficn >= average (3) 
""")0, traf fien < average 
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Figure 2: Probability density function of DOH 


3.2.3 Smartcard Features 

Students’ consumption patterns are captured according to 
the usage of smartcards. In the campus, students will use 
their smartcards when having meals in the canteens, shop- 
ping in the supermarket and taking a shower in the bath- 
house. Cumulatively, we calculate students’ daily costs and 


frequency of consumption of breakfast, shopping and bathing. 


3.2.4 Self-defined Features 


In order to obtain a comprehensive evaluation of all features 
extracted above, we calculated the score of each feature for 
each student Eq.4. Corr(X;) is the Pearson correlation co- 
efficient between the k;;, feature X; and student’s academic 
achievements which is shown in Table 6. Note that the 
academic achievements are in the form of rankings when 
calculating the Pearson correlation coefficient. Rank(xn) 
means the ranking of the student u,’ features among N stu- 
dents. For example, there are three students (u1, u2, us), 
and their in, feature (class attendances) are (0.8, 0.5, 0.6), 
we have Score} = 1, Score? = 0.66, Score? = 0.33 because 
Corr(Xi) < 0 according to Table 6. 


Then we defined the degree of hardworking(DOH) utilizing 
the feature scores Eq.5 where K is the count of all features 
we have extracted. We plot the probability density func- 
tion of DOH (Min-Max normalized) of three groups of stu- 
dents separated by their rankings of academic achievements 
as shown in Figure 2. From the figure we can find that the 
distributions of DOH are similar to the normal distribution 
and the averages are approximately 0.2, 0.5 and 0.8. The 
apparent distinction among three groups proves that our de- 
fined feature is a strong factor for prediction. Essentially the 
DOH is the weighed mean of feature scores and the weighs 
are the correlation coefficients. Besides DOH , self-defined 
features also include other statistics characteristics of fea- 
ture scores such as average and median. 


Score, = i — Rank(an))/N, Corr(Xr) > 0 (4) 


Rank(an)/N, Corr(Xr) <0 
K 
DOH" = S > (|Corr(Xx)| * Score; ) (5) 
k=1 


Duration(weekday) 
DOH Accumulate 
DOH Median 
DOH Average 
Class Attendance 

Network Traffic 

Network Recharge Cost 
Network Recharge Frequency 
Return Time(Weekday) 
Duration(Weekend) 

Return Time(Weekend) 
Breakfast Frequency 
Breakfast Cost 


[smartcard Feature 
[EN Network Feature 

[= ‘Nselt-defined Feature} | 
|| Trajectory Feature 


Shopping Cost 


0 20 40 60 80 100 120 


Figure 3: ANOVA F-test for binary classification 


3.3 Academic Achievement Prediction 

We separate the whole semester into four periods, the first 
three periods last for four weeks respectively and the last one 
lasts for six weeks. We calculate the mean of daily features 
respectively in four periods due to the fact that students’ 
behaviours may change along with the whole semester and 
generate different impacts on their academic achievements. 
Moreover, it is necessary to distinguish weekdays and week- 
ends in each period for different behavioural patterns. 


The academic achievement prediction is essentially a binary 
classification problem which can be used in academic precau- 
tion. For that the values of features vary greatly, in order to 
increase the speed of gradient descent and the accuracy of 
classifiers, we limited all the feature values to the range of 
0 to 1 using Min-Max normalization. We have 100 students 
who performed the worst according to their school reports 
to be positive labels and other 428 students to be negative 
labels. The dataset is split into training set and test set 
according to the ratio of 7:3. 


There might be relevancies among features which will de- 
crease the performance of classifiers. For example, students 
who spend long time surfing the campus network can possi- 
bly bear high network charges. In this work, we implement 
the state-of-the-art methods, Principal Component Analy- 
sis, to solve this problem. 


We trained various classification models such as Logistic Re- 
gression, Support Vector Machine, Naive Bayes and Decision 
Tree using cross-validation and evaluated on the test set. 
Moreover, we implemented the voting classifier to combine 
conceptually different machine learning classifiers and use a 
majority vote to predict the class labels. 


4. EXPERIMENTAL RESULTS 


4.1 Experimental Data 

We collect 1673706 records totally of 528 undergraduates in 
their third year from 19 classes in one faculty. The period we 
selected lasted for a complete semester of 140 days from Feb. 
29" 2016 to Jul. 17°”, 2016. The academic achievements 
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Table 6: Correlation Coefficient and P-value 


Feature Correlation coefficient P-value 
Class attendance -0.430 3.39e-25 
Time spent in dormitory( Weekday) 0.565 7.71e-46 
Time spent in dormitory(Weekend) 0.411 5.84e-23 
Time of return to dormitory(Weekday) -0.394 4.22e-21 
Time of return to dormitory(Weekend) -0.348 1.60e-16 
Wake-up time(Weekday) 0.222 2.69e-7 
Wake-up time(Weekend) 0.204 2e-6 
Shopping cost 0.215 6.09e-7 
Breakfast Frequency -0.337 1.9e-15 
Breakfast cost -0.266 5.55e-10 
Days out of campus 0.068 0.117 
Network traffic 0.406 2.11e-22 
Network cost 0.362 8.3e-18 
Network top up frequency 0.361 1.02e-17 
Feature score average -0.551 3.3e-43 
Feature score median -0.547 1.64e-42 
DOH -0.561 3.84e-45 


0 
(5h) {5h,7h) {7h,t0h) [10h (0.0.25) (02505) (0.50.75) (075.1) 
Dormitory Duration classes attendance 


(a) Duration in Dormitory — (b) Class Attendance 


Number of Students 


0 
(0.056) (056.16) (1626) 
Daily Network Flow Back to Dormitory Time 


fea) (20:00) (20:00,21:00) fat:00,22:00) [22100,) 


(c) Daily Network Traffic (d) Returning Dormitory 
Time 


Figure 4: Features statistics among different grades 


are the weighted average scores of all courses in a semester 
which includes quizzes, midterns and finals. 


What we should emphasize is that our experiment was com- 
pletely conducted under an anonymous situation. In our 
experiment, students’ IDs which reveal the true identities 
were mapped to anonymous IDs. 


4.2 Feature Analytics 

It is meaningful for educators to find out how students’ 
behaviours influence the academic achievements. We per- 
formed correlation coefficient analysis and ANOVA F-test to 


compare different features’ contributions to students’ study. 
The correlation coefficients are shown in Table 6. Note that 
time spent in dormitory on weekdays reaches the highest 
value of 0.565 with smallest P-value which is a novel find- 
ing. Our self-defined features also reach high correlation 
coefficients. Many previous studies have shown that class at- 
tendance is a significant and positive predictor of academic 
achievements which is also true in our study. Specifically, 
the self-defined features indicate high correlation coeffiients 
which is also proved in the ANOVA F-values for binary clas- 
sification shown in Figure 3. Thus it can been seen that 
our proposed method for new features is effective which will 
improve the performance of the prediction. Other than the 
self-defined features, the overall F-values of network features 
are relatively high while the smartcard features are slightly 
irrelevant. Note that the wake-up time and the days leaving 
campus which don’t achieve sufficient significance (p>0.001) 
are omitted in the Figure 3. 


To observe the differences of behaviours among students in 
detail, we display the distributions of four features which 
are highly relative with academic achievements in Figure 4. 
We divide all the students into four groups in the order of 
their academy achievements. Group A represents the best 
performers and group D represents the worst performers. 


As we can see in subgraph Figure 4a, more than 70% stu- 
dents of group A spend less than 7 hours in dormitories. On 
the contrary, most students in group C and D stay in dor- 
mitories for longer than 7 hours, some even staying for more 
than 10 hours. In subgraph Figure 4b, we find that class at- 
tendance is mainly distributed from 0.5 to 0.75 except group 
D in which more than 60% students’ attendance is less than 
0.5. Nearly 90% students of group A have a high attendance 
rate. Whether class attendance has influence on academic 
achievements is controversial.[9, 1] We discover that it is a 
relatively strong factor in our research. Daily network traffic 
is shown in subgraph Figure 4c, it is obvious that more than 
90% students spend less than 1 GB traffic daily in group A. 
Bad performers may spend more time for online gaming and 
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Table 7: Classification Results 


Mea Class0O ClassO —-_Class0 Class1 Class1 Class1 esse 
Precision Recall Fl-score Precision Recall F1-score y 

SVM 0.92 0.86 0.89 0.55 0.69 0.61 0.82 
SVM(PCA) 0.87 0.97 0.92 0.78 0.44 0.56 0.86 
LR 0.92 0.77 0.84 0.45 0.75 0.56 0.77 
LR(PCA) 0.88 0.97 0.92 0.79 0.47 0.59 0.87 
NB 0.92 0.71 0.80 0.39 0.75 0.52 0.72 
NB(PCA) 0.87 0.96 0.91 0.72 0.41 0.52 0.85 
DT(PCA) 0.91 0.93 0.92 0.73 0.69 0.71 0.87 
SVM+LR(PCA) 0.94 0.91 0.92 0.69 0.75 0.72 0.88 


movies which results in more network traffic. Subgraph Fig- 
ure 4d shows students’ time of return to dormitory. The left 
two groups of data tend to show an ascending trend while 
the right ones show a descending trend which depict that 
most students of group A and B come back to dormitories 
after 21:00 and are therefore more diligent. 


Figure 5 shows the distribution of students’ daily network 
rush hours in one month. The horizontal axis represents the 
24 hours in one day. The vertical axis represents students 
in the specific group according to academic achievements. 
Each student is represented by a row vector (v € R**) ac- 
cumulated in one month according to Eq.3. The color bar 
shows the numbers in vectors which are between 0 and 30 
(30 days in one month). Therefore, the brighter areas mean 
students always spend more time online during the specific 
periods. From the figure we can see, students of group A 
and B have a shorter span of rush hours and they always 
login the network near to 22:00 after they come back from 
classrooms, while rush hours of students of group C and D 
last for a longer time from about 15:00 to 23:00. 


4.3 Results of Prediction 


In our research the prediction task is an unbalanced classi- 
fication problem. According to students’ academic achieve- 
ments, the dataset is composed of 428 good performers (neg- 
ative samples) and 100 bad performers (positive samples). 
We conducted four different supervised learning algorithms 
consisting of Support Vector Machine, Logistic Regression, 
Decision Tree and Naive Bayes. The highest classification 
accuracy can be up to 88%. However it is not convincing 
enough for unbalanced classification problems to just inspect 
the classification accuracy. In this paper, we used precision, 
recall and F1-score to evaluate the performance of our mod- 
els. The average classification results of 10-Fold cross vali- 
dation are shown in Table 7. Specially we ensemble the Sup- 
port Vector Machine and Logistic Regression through voting 
classifier and realize the highest accuracy 88%. The princi- 
ple of the voting classifier is that the students are classified 
as negative samples when the two classifiers conflict with 
each other. 


5. CONCLUSIONS 


In this paper, we predicted that students’ academic achieve- 
ments to identify students who perform worse in their study 


based on our proposed framework AAP-EDM. Firstly, multi- 


source heterogeneous data is merged to generate semantic 
trajectories. Then we extracted features consisting of trajec- 


Hour 


(d) Group D 


Figure 5: Daily network rush hours 


tory features, network features and smartcard features. Fur- 
thermore, self-defined features are proposed to explore fea- 
tures comprehensively. At last, we have evaluated the frame- 
work through multiple classification models using students’ 
real world data. The results show that our proposed frame- 
work is feasible and meaningful for educational supervision 
and warning. Our research provides promising approaches 
to transform the collage education from traditional descrip- 
tive analytics to predictive analytics. We will improve our 
framework through further research and concentrate on re- 
alizing the prescriptive analytics in college education. 
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